Datascience Lifecycle mainly involves:
I will use data from 45 a kaggle walmart store sales forecasting competition to show the lifecycle of a data science project.
The ability to accurately predict weekly sales for any retail store is crucial in planning for inventory and maximizing efficiency to improve customer experience and profits.
Below are the details instructions for the competition: You are provided with historical sales data for 45 Walmart stores located in different regions. Each store contains a number of departments, and you are tasked with predicting the department-wide sales for each store.
In addition, Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of which are the Super Bowl, Labor Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data
#import some of the libraries needed for the project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bokeh.charts import Line, Scatter, TimeSeries, show, output_file, vplot
from bokeh.io import output_notebook
from bokeh.charts import Bar, output_file, show
from bokeh.sampledata.autompg import autompg as df
output_notebook()
Features file description This file contains additional data related to the store, department, and regional activity for the given dates. It contains the following fields:
features = pd.read_csv("features.csv")
features.head()
Stores.csv file: description This file contains anonymized information about the 45 stores, indicating the type and size of store.
stores = pd.read_csv("stores.csv")
stores.head()
Train.csv file description: This is the historical training data, which covers to 2010-02-05 to 2012-11-01. Within this file you will find the following fields:
train = pd.read_csv("train.csv")
train.head()
Test.csv file description This file is identical to train.csv, except we have withheld the weekly sales. You must predict the sales for each triplet of store, department, and date in this file
test = pd.read_csv("test.csv")
test.head()
Split the dates of all tables into year, month and day.
# Functions splits a
#parameters: dates - pandas column dataframe containing dates in the format year-month-date
# returns: a tuple of arrays containing year, month and day in the format (year, month, date)
def split_date(dates):
year = []
month = []
day = []
new_date = []
for row in dates:
new_date = row.split("-")
year.append(int(new_date[0]))
month.append(int(new_date[1]))
day.append(int(new_date[2]))
return (year, month, day)
#Split the dates for the test table into year, month and Day
year, month, day = split_date(test["Date"])
test["Year"] = year
test["Month"] = month
test["Day"] = day
test.head()
#Split the dates for the train table into year, month and Day
year, month, day = split_date(train["Date"])
train["Year"] = year
train["Month"] = month
train["Day"] = day
train.head()
year, month, day = split_date(features["Date"])
features["Year"] = year
features["Month"] = month
features["Day"] = day
features.head()
We would also need to determine if any relationship exists between our sales data and other features which are in other tables. For easier analysis between sales in training data and other features I will left join train table with stores table and features table to add more columns to the training data.
left join train(left) table with features(right) table
train_data = train.merge(features, on =['Store', 'IsHoliday', 'Year', 'Month', 'Day'], how = 'left')
train_data
Delete unnnecessary columns such as Date_x and Date_y
del train_data['Date_x']
train_data.columns
Left join train_data(left) table with stores(right) table This will help us get additional columns: store type and size which help us have more features to analyze our sales
train_data = train_data.merge(stores, on = 'Store', how = 'left')
train_data.head()
Change the column data types to types that allow for easier analysis and manipulation of data
#1. type changed from string to date_time type
train_data['Date_y'] = pd.to_datetime(train_data['Date_y'], yearfirst = True)
#2. type changed from string to numeric
train_data[['Store','Dept','Weekly_Sales', 'Temperature']] = train_data[['Store','Dept','Weekly_Sales', 'Temperature']].apply(pd.to_numeric)
train_data[['Fuel_Price','CPI','Unemployment', 'Size']] = train_data[['Fuel_Price','CPI','Unemployment', 'Size']].apply(pd.to_numeric)
train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']] = train_data[['MarkDown1','MarkDown2','MarkDown3','MarkDown4','MarkDown5']].apply(pd.to_numeric)
#3. Change IsHoliday from boolean to binary
holiday = []
types = []
for index, row in train_data.iterrows():
if (row['IsHoliday'] == True):
holiday.append(1)
else:
holiday.append(0)
#4. Change Type column to numeric values
if (row['Type'] == 'A'):
types.append(1)
elif (row['Type'] == 'B'):
types.append(2)
elif (row['Type'] == 'C'):
types.append(3)
else:
types.append(4)
train_data['Holiday'] = holiday
train_data['Type_n'] = types
train_data.head()
We are going to reduce the size of the dataframe to a sample of 10,000 rows for easier visualization and manipulation because the current size of the dataframe requires huge computing power
#save the current dataframe to another variable
train_df = train_data.copy(deep = True)
train_df = train_df.sample(n = 30000)
train_df.head()
t = TimeSeries(train_df,
x='Date_y', y=['Weekly_Sales'],
title="Timeseries", ylabel='Weekly Sales', legend=True)
output_file("weeklysales.html")
show(t)
The graph above clearly shows a spike in sales during specific times within the year
Draw a line chart to get the Average monthly sales for walmart stores
#create dataframe to hold the mean monthly sales
monthly_sales = train_df.groupby(['Month'])['Weekly_Sales'].mean()
monthly_sales = monthly_sales.to_frame()
monthly_sales
monthly_sales_bar = Bar(monthly_sales, values='Weekly_Sales', title="Mean Monthly Sales")
output_file("Mean_monthly_sales.html")
show(monthly_sales_bar)
From the data, december has more weekly sales than other months
Monthly sales by type of store
monthly_sales_bytype = train_df.groupby(['Month','Type'])['Weekly_Sales'].mean()
monthly_sales_bytype = monthly_sales_bytype.to_frame()
monthly_sales_bytype.reset_index(inplace=True)
monthly_sales_bytype.head()
monthly_sales_bytype_bar = Bar(monthly_sales_bytype, label = 'Month', values='Weekly_Sales', group = 'Type', title="Mean Monthly Sales By Type of Store")
output_file("Mean_monthly_sales_bytype.html")
show(monthly_sales_bytype_bar)
from the bar chart above it is clear that stores of type A have highest average weekly sales followed by stores of type B then type C
p = Scatter(train_df, x='Temperature', y='Weekly_Sales', title="Temperature vs Weekly Sales",
xlabel="Temperature in Degrees Farenheit", ylabel="Weekly Sales")
output_file("temperature_sales_scatter.html")
show(p)
From the table above there is no relationship between temperature and weekly sales
Is there a relationship between walmart store size and weekly sales?
store_size = train_df.groupby(['Size'])["Weekly_Sales"].mean()
store_size = store_size.to_frame()
store_size.reset_index(inplace=True)
store_size.head()
p = Scatter(store_size, x='Size', y='Weekly_Sales', title="Store size vs Weekly Sales",
xlabel="Store size in sqft", ylabel="Weekly Sales")
output_file("store_size_sales_scatter.html")
show(p)
From the scatter plot there seems to be a linear relationship between store size and the average amount of weekly sales
Fuel_Price, CPI, Unemployment
Is there a relationship between fuel price and Weekly sales?
p = Scatter(train_df, x='Fuel_Price', y='Weekly_Sales', title="Fuel Price vs Weekly Sales",
xlabel="Fuel Price in dollars/gallon", ylabel="Weekly Sales")
output_file("fuel_price_sales_scatter.html")
show(p)
Conclusion: From the plot above there is no relationship between fuel price and weekly sales
Is there a relationship between Consumer Price Index(CPI) and Weekly sales?
p = Scatter(train_df, x='CPI', y='Weekly_Sales', title="Consumer Price Index vs Weekly Sales",
xlabel="Consumer Price Index", ylabel="Weekly Sales")
output_file("CPI_sales_scatter.html")
show(p)
From the scatter plot above there is no relationship between consumer price Index and weekly sales
Is there a relationship between Unemployment and Weekly sales?
p = Scatter(train_df, x='Unemployment', y='Weekly_Sales', title="Unemployment vs Weekly Sales",
xlabel="Unemployment", ylabel="Weekly Sales")
output_file("unemployment_sales_scatter.html")
show(p)
From the Scatter plot above there exists no relationship between unemployment and weekly sales
Is there a relationship between holidays and weekly sales
holiday_df = train_df.groupby(['Holiday'])['Weekly_Sales'].mean()
holiday_df = holiday_df.to_frame()
holiday_df.reset_index(inplace=True)
holiday_df.head()
holiday_df_bar = Bar(holiday_df, values='Weekly_Sales', title="Mean Holiday Sales")
output_file("Mean_holiday_sales.html")
show(holiday_df_bar)
Conclusion: From the bar graph above holidays experience more sales than non-holidays
Is there a relationship between Store department and weekly sales?
dept_df = train_df.groupby(['Dept'])['Weekly_Sales'].mean()
dept_df = dept_df.to_frame()
dept_df.reset_index(inplace=True)
dept_df.head()
dept_df_bar = Bar(dept_df, values='Weekly_Sales', title="Mean Holiday Sales")
output_file("Mean_department_sales.html")
show(dept_df_bar)
From the Bar chart above there seems to be a strong relationship between Store department and weekly sales
Store size, Holiday, Store type, Store department month of the year have a strong effect on weekly Sales while other factors such as Temperature, Consumer Price Index (CPI), Fuel price and Unemployment have little to no impact on weekly sales
#import libraries that will be used for machine learning
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsRegressor
From the exploratory data analysis Store size, Holiday, Store type, Store department and month of the year have an impact on a store's weekly sales. Therefore we will only consider these factors in determining weekly sales. I will use a KNN regression classifier because to predict future weekly sales because since weekly sales are influenced by several factors such as Store size, Holiday and Store department, I will first find the nearest neighbors to the values I am trying to predict by classifying values based on Holiday, Store type, Store department, month and year. I will then apply linear regression on the nearest neighbors to predict the values of the score.
Below is a simple example of a KNN regressor predictor
#sample input X and y
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
#knn regressor model that takes n parameters
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit(X, y)
#predict value based on model
print(neigh.predict([[1.5]]))
How many nearest neighbors should we use in the linear regression? I will determine the accuracy of weekly sales predictions and choose the optimum number for the nearest neighbors. I will use K-Fold cross-validation) to determine the accuracy of my predictions.
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]
#array keeps track of the score of each size of nearest neighbor
scores = []
for n in range(1,12):
neigh = KNeighborsRegressor(n_neighbors=n, weights = 'distance')
score = cross_val_score(neigh, X, y, cv = 10)
scores.append([n, score.mean()])
#convert array to dataframe
scores = pd.DataFrame(data = scores,columns = ['neighbors', 'cross_val_score'])
scores
Plot the results from the table to show the optimum size to use for number of nearest neighbors
line = Line(scores, x='neighbors',y='cross_val_score', title="K-Nearest neighbors vs cross validation score", legend="top_left", xlabel = 'k-nearest neighbor', ylabel='cross validation score')
output_file("neighbors_score.html")
show(line)
Conclusion: From the graph above the number of nearest neighbors to use as an estimator reaches its peak when nearest neighbors at 4.
What is the optimum size to use for K-fold cross validation? First step is to determine the accuracy of your predictions using different sizes for K-fold cross validation.
scores_kfold = []
for kfold_size in range(3,30):
neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
score = cross_val_score(neigh, X, y, cv = kfold_size)
scores_kfold.append([kfold_size, score.mean()])
scores_kfold_df = pd.DataFrame(data = scores_kfold,columns = ['kfold_size', 'cross_val_score'])
scores_kfold_df
Second step is plot the results
line = Line(scores_kfold_df, x='kfold_size',y='cross_val_score', title="Kfold sizes vs cross validation score", legend="top_left", xlabel = 'kfold sizes', ylabel='cross validation score')
output_file("kfold_sizes_score.html")
show(line)
Observation: peak is reached when kfold-sizes is 20 and the graph levels off. Conclusion: USe 20 as the k-fold size for the cross validation score because beyond 20 as the size increases there are is no significant improvement of the cross-validation score.
from IPython.display import HTML
We can now use our KNearest regressor model to predict future sales which will now help us learn more about future sales
# X values represent the values used to predict the weekly sales
# y is the weekly sales which is value to be predicted
X = train_df[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
y = train_df[['Weekly_Sales']]
neigh = KNeighborsRegressor(n_neighbors=4, weights = 'distance')
neigh.fit(X, y)
X_test = test[['Store', 'Dept', 'IsHoliday', 'Year', 'Month']]
test_predict = neigh.predict(X_test)
test_predict
test['Weekly_Sales_Predictions'] = test_predict
test
From the exploratory data analysis we realised that Stores of type A produce significantly higher sales than type B and type C stores. Inorder to increase sales we should consider focus our efforts on type A stores since they are the store's biggest revenue driver.
HTML(filename='Mean_monthly_sales_bytype.html')